Skip to content

Conversation

@dqkqd
Copy link
Contributor

@dqkqd dqkqd commented Oct 26, 2025

Which issue does this PR close?

Rationale for this change

with_param_values doesn't substitute params' type if it is used on EmptyRelation.
Thus, causing SELECT $1, $2 to have incorrect schema after substitution.

For example: after replacing $1 = 1, $2 = "s", the schema is [Null, Null], but it should
be [Int64, Utf8].

This schema type mismatch is resolved before converting to physical plan by
the type_coercion rule in the analyzer.

.map_data(|plan| plan.recompute_schema())

So I'm not quite sure should we fix it in with_param_values.

What changes are included in this PR?

  • recompute the schema after replacing param values.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Oct 26, 2025
@dqkqd dqkqd marked this pull request as ready for review October 26, 2025 09:45
Jefffrey
Jefffrey previously approved these changes Oct 27, 2025
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

Do you think it's also possible to add the queries from the original issue as tests?

@dqkqd
Copy link
Contributor Author

dqkqd commented Oct 27, 2025

Thanks @Jefffrey

I tried to add test for the subquery alias: SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b); similar to the issue:

let df = ctx.sql("SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b)").await?;
let df_with_params_replaced = df.with_param_values(vec![
    ScalarValue::UInt32(Some(1)),
    ScalarValue::Utf8(Some("foofy".to_string())),
])?;
dbg!(df_with_params_replaced.collect().await?[0].schema());
#> Error: ArrowError(InvalidArgumentError("column types must match schema types, expected Null but found UInt32 at column index 0"), Some(""))

But the test still fail, the plan after replacing params (the schema for t is [Null, Null])

        Projection: t.a, t.b [a:Null;N, b:Null;N]
          SubqueryAlias: t [a:Null;N, b:Null;N]
            Projection: column1 AS a, column2 AS b [a:Null;N, b:Null;N]
              Values: (Int32(1) AS $1, Utf8("s") AS $2) [column1:Null;N, column2:Null;N]

The expected params:

        Projection: t.a, t.b [a:Int32;N, b:Utf8;N]
          SubqueryAlias: t [a:Int32;N, b:Utf8;N]
            Projection: column1 AS a, column2 AS b [a:Int32;N, b:Utf8;N]
              Values: (Int32(1) AS $1, Utf8("s") AS $2) [column1:Int32;N, column2:Utf8;N]

I think it needs more works, convert to draft for now.

@dqkqd dqkqd marked this pull request as draft October 27, 2025 19:15
@dqkqd dqkqd force-pushed the with-param-values-incorrect-type-schema branch 4 times, most recently from ba80e38 to ac1c953 Compare November 1, 2025 08:05
@dqkqd dqkqd force-pushed the with-param-values-incorrect-type-schema branch from ac1c953 to 599bed8 Compare November 1, 2025 08:26
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules substrait Changes to the substrait crate functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Nov 2, 2025
@dqkqd dqkqd force-pushed the with-param-values-incorrect-type-schema branch from 6a8297e to 8e038e1 Compare November 2, 2025 01:39
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules substrait Changes to the substrait crate functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Nov 2, 2025
@dqkqd dqkqd force-pushed the with-param-values-incorrect-type-schema branch from 8e038e1 to bb76688 Compare November 2, 2025 01:46
@github-actions github-actions bot added the substrait Changes to the substrait crate label Nov 2, 2025
@dqkqd dqkqd force-pushed the with-param-values-incorrect-type-schema branch from 38177db to d3cc940 Compare November 2, 2025 02:03
@dqkqd dqkqd changed the title fix: with_param_values on EmptyRelation returns incorrect schema fix: with_param_values on LogicalPlan::EmptyRelation and LogicalPlan::Values returns incorrect schema Nov 2, 2025
Copy link
Contributor Author

@dqkqd dqkqd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some changes to handle the value based relation in the issue.

Summary:

  • Always recompute_schema after replacing placeholder, to ensure parent nodes can see changes in the child node's schema.

  • Recompute schema for LogicalPlan::Values: we are using the old schema for LogicalPlan::Values here , the change in values' data type can never be reflected to the schema.

I tried to use slt test but couldn't find a way to EXPLAIN a plan after PREPARE.


/// Test for https://github.com/apache/datafusion/issues/18102
#[tokio::test]
async fn test_query_parameters_in_values_list_relation() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue mentioned two queries.
This tests the first one: SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b)
The second one is verified in the above test: SELECT $1, $2

Comment on lines -220 to -222
if n_cols == 0 {
return plan_err!("Values list cannot be zero length");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use LogicalPlanBuilder::values to recompute the schema for LogicalPlan::Values
causing this line to fail the test from #12339.

Since we allow values without columns, I think this line should be removed.

Comment on lines +651 to +659
// `old_field`'s data type is unknown but `new_field`'s is known
if old_field.data_type().is_null()
&& !new_field.data_type().is_null()
{
let field = old_field
.as_ref()
.clone()
.with_data_type(new_field.data_type().clone());
(table_ref.cloned(), Arc::new(field))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change, the new schema needs new_field's data type.

What about other attributes?

I tried to apply new_field's nullability, but the test in insert.slt:305 failed,
because its schema require NOT NULL when the new_field is nullable.

CREATE TABLE table_without_values(field1 BIGINT NOT NULL, field2 BIGINT NULL);

-- statement error Invalid argument error: Column 'column1' is declared as non-nullable but contains null values
insert into table_without_values values(NULL, 300);

So the new schema shouldn't use new_field's nullability.
However, I wasn't quite sure whether other attributes (i.e. metadata) need to be populated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to wrap my head around what's happening here. From what I can tell, there are two things to consider here:

  1. When recomputing new_plan, we lose the column names on the old plan (self) so this is a way of keeping that
  2. If we ignore the nullability from self and go with new_plan nullability that also runs into other issue? (Though from what I see this may still cause an error, just later in the pipeline/has a different message)

Am I correct in my understanding here?

I wonder for point 1 if we can do this in a nicer way, perhaps by introducing something like LogicalPlanBuilder::values_with_names to preserve the column names from the start instead of amending after the fact.

For point 2 I'm not sure, perhaps an explicit check upfront is better 🤔

Copy link
Contributor Author

@dqkqd dqkqd Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking at this

Am I correct in my understanding here?

Correct.
Allow me to further explain why we need self and why we don't need the whole self.

We need self:

  • Point 1: new plan uses different name, this is clearly an issue.
  • Point 2: new plan uses different nullability, this allows
    INSERT INTO non_null_table VALUES (NULL) to pass (even though it shouldn't),
    because non-nullable self schema is replaced with nullable new plan schema.

We don't need the whole self:
Let use the sql in this issue SELECT a, b FROM (VALUES ($1, $2)) AS t(a, b) as reference.
After replacing params:

  • self (t(a,b)) doesn't know about a and b data types.
  • new plan (VALUES ($1, $2)) knows about the data types, but doesn't know about a and b.

So, I was trying to apply new data types while keeping everything in self intact.
It was basically just these two lines, but since I couldn't find something similar to
merge_data_type, I ended up with the whole schema migration.

LogicalPlan::Values(Values { schema, values }) => {
+               let new_plan = // compute new plan here
+               schema.merge_data_type(new_plan.schema());
                Ok(LogicalPlan::Values(Values { schema, values }))

Comment on lines +666 to +670
let schema = DFSchema::new_with_metadata(
qualified_fields,
schema.metadata().clone(),
)?
.with_functional_dependencies(schema.functional_dependencies().clone())?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new schema here because I didn't know how to modify fields in a schema.

Please tell me if we have APIs for modifying schema's fields (then some clone can be avoided).

Comment on lines +1513 to +1515
// always recompute the schema to ensure the changed in the schema's field should be
// poplulated to the plan's parent
.map_data(|plan| plan.recompute_schema())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to computed even though plan isn't transformed.
Otherwise changes in a child's schema cannot be populated to
its ancestors.


// replaced
assert_snapshot!(plan.display_indent_schema(), @r"
Projection: t.a, t.b, t.c [a:Int32;N, b:Int32;N, c:Int64;N]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the code change, this line would be:

        Projection: t.a, t.b, t.c [a:Int32;N, b:Null;N, c:Int64;N]

@dqkqd dqkqd marked this pull request as ready for review November 2, 2025 06:09
@Jefffrey Jefffrey self-requested a review November 3, 2025 03:48
@Jefffrey Jefffrey dismissed their stale review November 3, 2025 03:52

New changes

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 4, 2025

I aim to review this tomorrow (or at least sometime this week)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SubqueryAlias, Values, and/or EmptyRelation have incorrect schemas after replacing Placeholder values

2 participants